High performance algorithm modified Kneser-Ney (MKN) n-gram model as implemented by Keneth Heafield, described in his seminal paper “Scalable Modified Kneser-Ney Language Model Estimation”.
Allows us to control the size of the resulting model through variables:
Pruning: minimum number of times an ngram appears in the training set.
Sample size of the original data.
Do we ignore case? Disregarding it should result in less data if the rest of the variables remain the same.
Ngram order: smaller ‘n’ result in smaller n-gram models